refactor: remove redundant partitioned_by_file_group file scan field by Phoenix500526 · Pull Request #23189 · apache/datafusion

Phoenix500526 · 2026-06-25T15:25:46Z

Which issue does this PR close?

Closes Remove redundant partitioned_by_file_group file scan field #23099 .

Rationale for this change

FileScanConfig had two overlapping ways to declare a file scan's output
partitioning:

partitioned_by_file_group: bool — a shorthand meaning "the file groups are
organized by Hive partition column values, so the output is Hash-partitioned on
those columns", and
output_partitioning: Option<Partitioning> — a general, explicit declared
partitioning (added in Add ListingOptions::output_partitioning and FileScanConfig::output_partitioning for pre-defined file partitioning #22657).

The bool is just a lazy shorthand for one specific output_partitioning value
(Partitioning::Hash over the partition columns), and every place that consumed
it (output_partitioning(), repartitioned(), create_sibling_state()) already
checked output_partitioning.is_some() || partitioned_by_file_group. Keeping both
is redundant and the ListingTable builder ended up setting both. This PR makes
output_partitioning the single source of truth.

What changes are included in this PR?

Following the issue's first option ("Remove partitioned_by_file_group"):

Remove FileScanConfig::partitioned_by_file_group, the corresponding
FileScanConfigBuilder field, and the with_partitioned_by_file_group builder
method.
ListingTable::scan now derives the partition-column Partitioning::Hash
itself (once its file groups are finalized, so the partition count is correct)
and passes it through the existing with_output_partitioning. The previous
with_output_partitioning(declared) + with_partitioned_by_file_group(...)
double-set is collapsed into one branch.
hash_partitioning_from_partition_fields is made pub so ListingTable
(a separate crate) can reuse the derivation instead of duplicating the
column-index resolution.
proto already round-trips output_partitioning, so no behavior is lost: the
now-vestigial partitioned_by_file_group wire field is left unset on write and
ignored on read. The field is kept in the .proto definition for backward
compatibility.
output_partitioning() / create_sibling_state() / repartitioned() now key
solely off output_partitioning.

Are these changes tested?

Yes — by existing tests, updated to the new single-field model:

datafusion-datasource: test_output_partitioning_with_partition_columns,
test_output_partitioning_no_partition_columns,
test_declared_output_partitioning_projects_with_scan, and the file_stream
work-stealing test morsel_partitioned_by_file_group_keeps_files_local (which
verifies that a declared output partitioning keeps each stream's files local).
datafusion-proto: roundtrip_parquet_exec_output_partitioning (and the other
roundtrip_parquet_exec_* cases) cover the partitioning round-trip. The old
roundtrip_parquet_exec_partitioned_by_file_group test exercised the removed
API and is dropped, as its coverage is subsumed by the output_partitioning
round-trip test.

All of the above pass, along with cargo fmt --all --check and
cargo clippy --all-targets --all-features -- -D warnings for the affected
crates.

Are there any user-facing changes?

Yes — public API changes :

Removed: the public FileScanConfig::partitioned_by_file_group field and the
FileScanConfigBuilder::with_partitioned_by_file_group method. Callers should
set with_output_partitioning(Some(Partitioning::Hash(..))) instead (or use the
now-public hash_partitioning_from_partition_fields helper).
Added: hash_partitioning_from_partition_fields is now pub.

Query results, optimizer decisions (e.g. eliding RepartitionExec), and the
serialized (proto) wire format are unchanged. There is one display-only
change: EXPLAIN now renders output_partitioning=Hash(...) on DataSourceExec
for partition-grouped scans. The scan already produced that partitioning before
(it was derived lazily inside output_partitioning()); it is now stored on the
output_partitioning field and therefore shown. The
repartition_subset_satisfaction and preserve_file_partitioning slt expected
plans are updated accordingly.

cargo-semver-checks will flag the removals as breaking, which is expected for
this cleanup.

github-actions · 2026-06-25T15:32:07Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-catalog-listing v54.0.0 (current)
       Built [  34.926s] (current)
     Parsing datafusion-catalog-listing v54.0.0 (current)
      Parsed [   0.010s] (current)
    Building datafusion-catalog-listing v54.0.0 (baseline)
       Built [  34.538s] (baseline)
     Parsing datafusion-catalog-listing v54.0.0 (baseline)
      Parsed [   0.010s] (baseline)
    Checking datafusion-catalog-listing v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.092s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  70.588s] datafusion-catalog-listing
    Building datafusion-datasource v54.0.0 (current)
       Built [  28.565s] (current)
     Parsing datafusion-datasource v54.0.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-datasource v54.0.0 (baseline)
       Built [  28.556s] (baseline)
     Parsing datafusion-datasource v54.0.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-datasource v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.287s] 223 checks: 221 pass, 2 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  FileScanConfigBuilder::with_partitioned_by_file_group, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/1c6f8a0256807fdae23280a2d0c22cfdac108e76/datafusion/datasource/src/file_scan_config/mod.rs:520

--- failure struct_pub_field_missing: pub struct's pub field removed or renamed ---

Description:
A publicly-visible struct has at least one public field that is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/struct_pub_field_missing.ron

Failed in:
  field partitioned_by_file_group of struct FileScanConfig, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/1c6f8a0256807fdae23280a2d0c22cfdac108e76/datafusion/datasource/src/file_scan_config/mod.rs:211

     Summary semver requires new major version: 2 major and 0 minor checks failed
    Finished [  58.498s] datafusion-datasource
    Building datafusion-proto v54.0.0 (current)
       Built [  45.319s] (current)
     Parsing datafusion-proto v54.0.0 (current)
      Parsed [   0.015s] (current)
    Building datafusion-proto v54.0.0 (baseline)
       Built [  45.347s] (baseline)
     Parsing datafusion-proto v54.0.0 (baseline)
      Parsed [   0.016s] (baseline)
    Checking datafusion-proto v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.296s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [  91.959s] datafusion-proto
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 140.741s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.018s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 140.075s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.089s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 283.454s] datafusion-sqllogictest

`FileScanConfig` had two overlapping ways to declare file scan output partitioning: the `partitioned_by_file_group` bool and `output_partitioning`. Collapse them onto `output_partitioning` as the single source of truth. - Remove the `partitioned_by_file_group` field, the builder field, and the `with_partitioned_by_file_group` builder method. - `ListingTable` now derives the partition-column `Partitioning::Hash` once its file groups are finalized and passes it via `with_output_partitioning`; `hash_partitioning_from_partition_fields` is made `pub` for this. - proto already round-trips `output_partitioning`, so the now-vestigial wire bool is left unset on write and ignored on read (the proto field is kept for backward compatibility). Closes apache#23099. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>

After collapsing `partitioned_by_file_group` onto `output_partitioning`, the declared Hash partitioning is now stored on the scan and therefore rendered by `DataSourceExec`'s Display. Update the affected sqllogictest expected plans accordingly. Behavior is unchanged; only the EXPLAIN text gains an `output_partitioning=Hash(...)` entry on partition-grouped scans. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>

github-actions Bot added catalog Related to the catalog crate proto Related to proto crate datasource Changes to the datasource crate labels Jun 25, 2026

github-actions Bot added the auto detected api change Auto detected API change label Jun 25, 2026

Phoenix500526 added 2 commits June 26, 2026 10:14

Phoenix500526 force-pushed the issue/23099 branch from e9ed249 to 30a0d3c Compare June 26, 2026 02:14

github-actions Bot added the sqllogictest SQL Logic Tests (.slt) label Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: remove redundant partitioned_by_file_group file scan field#23189

refactor: remove redundant partitioned_by_file_group file scan field#23189
Phoenix500526 wants to merge 2 commits into
apache:mainfrom
Phoenix500526:issue/23099

Phoenix500526 commented Jun 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Phoenix500526 commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Phoenix500526 commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading